#Setting up
Property Value Prediction Model - predictive model of price incorporating structural and locational characteristics - summary statistics w variable descriptions - correlation matrix - 4 home price correlation scatterplots - 1 map of dep variable (sale price) - 3 maps of 3 most interesting independent variables (factors we choose) - testing and training dataset??? - discussion of performance - plot of predicted prices as function of observed prices - map of residuals
The Seattle housing market is a vibrant and complex sector that encapsulates the city’s economic growth and diverse population. Seattle offers a mix of modern high-rise apartments, classic craftsman houses, and waterfront properties, which serve the varied preferences of its residents. The housing prices in Seattle are shaped by an array of factors, including location, architectural style, age of the property, and proximity to amenities like parks, schools, and public transport.
This project evaluates factors affecting housing prices to create an accurate and generalizable OLS regression model to predict them. Housing prices in Seattle have been subject to fluctuations over the years, influenced by various factors such as the neighborhood, property type, and home condition. Sale prediction algorithms are crucial tools that help different stakeholders in the housing market make informed decisions, identify inequalities, and ensure the real estate market’s overall health and fairness.
Building a predictive model for property valuation is challenging, requiring selecting the right features, addressing collinearity among variables, choosing appropriate modeling techniques, and ensuring interpretability. Creating this model was an iterative process that involved multiple rounds of trial and error, particularly in attempts to minimize the errors in predictions. The maps, correlation matrix, and scatterplots illustrate various variables that were considered, wrangled, and filtered as part of the data analysis process.
if (!require(pacman)){install.packages("pacman"); library(pacman)}
p_load (sf, tidyverse, knitr, kableExtra, rmarkdown, tidycensus, dplyr, scales, stringr, ggcorrplot, readxl, data.table, caTools, tmap, stargazer, spdep, caret, ckanr, FNN, grid, gridExtra, jtools, broom, tufte, rmarkdown, brewer.pal, ggplot2)
census_api_key("b3eda1fa84dde3c5ad443fd407d48f2584ab2726", overwrite = TRUE)
Census, OpenData Seattle and provided Dataset
The code below is used to import the foundational dataset for our model. This dataset contains information on home sales prices and property characteristics in Seattle, Kings County for the years 2014 and 2015. The model we are developing is designed to make predictions about home prices and utilizes specific property attributes from this dataset to enhance the accuracy of those predictions.
kingsCounty <- read.csv("kc_house_data.csv")
kingsCoSF <- st_as_sf(kingsCounty, coords = c("long", "lat"), crs = 4326)
seattleTreeCanopy <- read.csv("Seattle_Tree_Canopy_2016_2021_RSE_Census_Tracts.csv")
crs <- st_crs(kingsCoSF)
print(crs)
Getting census tracts data from HUD USPS Crosswalk files to assign census tracts to the original data.
#Data from HUD USPS Zipcode Crosswalk files
hud_data <- read_excel("ZIP_TRACT_032015.xlsx")
desired_zips <- c("98101", "98102", "98103", "98104", "98105", "98106", "98107", "98108",
"98109", "98111", "98112", "98113", "98114", "98115", "98116", "98117",
"98118", "98119", "98120", "98121", "98122", "98123", "98124", "98125",
"98126", "98131", "98132", "98133", "98134", "98135", "98136", "98138",
"98144", "98145", "98146", "98148", "98154", "98155", "98158", "98160",
"98161", "98164", "98166", "98168", "98171", "98174", "98177", "98178",
"98188", "98198", "98199")
#filtering out data that's not Seattle
seattleData <- kingsCoSF %>%
filter(zipcode %in% c("98101","98102","98103","98104","98105", "98106","98107", "98108","98109","98111","98112","98113",'98114","98115"."98116","98117","98118","98119","98120","98121","98122","98123","98124","98125","98126","98131","98132',"98133","98134","98135","98136","98138","98144","98145","98146","98148","98154","98155","98158","98160","98161","98164","98166","98168","98171","98174","98177","98178","98188","98198","98199"))
seattleData$zipcode <- as.character(seattleData$zipcode)
hud_data$ZIP <- as.character(hud_data$ZIP)
#Filtering HUD data for zipcodes of interest
hud_data_filtered <- hud_data[hud_data$ZIP %in% desired_zips, ]
#Merging the filtered HUD dataset with the seattle housing data
seattleData <- merge(seattleData, hud_data_filtered, by.x = "zipcode", by.y = "ZIP", all.x = TRUE)
#removing extra columns
seattleData <- seattleData %>%
select(-OTH_RATIO, -RES_RATIO, -TOT_RATIO, -BUS_RATIO)
seattleData <- seattleData %>%
mutate(TRACT = str_sub(TRACT, -6)) # Keep the last 6 characters
seattleData <- seattleData %>%
filter(yr_built > 0)
#categorizing based on basement presence
seattleData <- seattleData %>%
mutate(basementPresent = case_when(
sqft_basement == 0 ~ 'No Basement',
sqft_basement >= 1 ~ 'Basement'))
#categorizing based on condition
seattleData <- seattleData %>%
mutate(conditionType = case_when(
condition == '1' ~ 'Very Poor',
condition == '2' ~ 'Poor',
condition == '3' ~ 'Average',
condition == '4' ~ 'Good',
condition == '5' ~ 'Very Good'))
seattleData <- seattleData %>%
mutate(houseSize = case_when(
sqft_living >= 2000 ~ 'Big',
sqft_living <= 1999 ~ 'Small'))
seattleDisability <- seattleTreeCanopy %>%
select(PCT_ADULT_WITH_DISABILITIES)
#summary statistics should ideally combine the different datasets
summary(seattleData)
## zipcode id date price
## Length:64080 Min. :1.200e+06 Length:64080 Min. : 78000
## Class :character 1st Qu.:2.011e+09 Class :character 1st Qu.: 306000
## Mode :character Median :3.973e+09 Mode :character Median : 414050
## Mean :4.484e+09 Mean : 502873
## 3rd Qu.:6.909e+09 3rd Qu.: 587812
## Max. :9.900e+09 Max. :7700000
## bedrooms bathrooms sqft_living sqft_lot
## Min. : 0.000 Min. :0.000 Min. : 380 Min. : 520
## 1st Qu.: 3.000 1st Qu.:1.000 1st Qu.: 1240 1st Qu.: 4000
## Median : 3.000 Median :1.750 Median : 1600 Median : 6099
## Mean : 3.191 Mean :1.881 Mean : 1777 Mean : 6987
## 3rd Qu.: 4.000 3rd Qu.:2.500 3rd Qu.: 2150 3rd Qu.: 8410
## Max. :33.000 Max. :8.000 Max. :12050 Max. :137214
## floors waterfront view condition
## Min. :1.000 Min. :0.000000 Min. :0.0000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:3.000
## Median :1.000 Median :0.000000 Median :0.0000 Median :3.000
## Mean :1.469 Mean :0.008146 Mean :0.2534 Mean :3.445
## 3rd Qu.:2.000 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:4.000
## Max. :3.500 Max. :1.000000 Max. :4.0000 Max. :5.000
## grade sqft_above sqft_basement yr_built
## Min. : 4.000 Min. : 380 Min. : 0.0 Min. :1900
## 1st Qu.: 7.000 1st Qu.:1060 1st Qu.: 0.0 1st Qu.:1933
## Median : 7.000 Median :1320 Median : 0.0 Median :1953
## Mean : 7.295 Mean :1453 Mean : 323.3 Mean :1957
## 3rd Qu.: 8.000 3rd Qu.:1660 3rd Qu.: 630.0 3rd Qu.:1979
## Max. :13.000 Max. :8570 Max. :3480.0 Max. :2015
## yr_renovated sqft_living15 sqft_lot15 TRACT
## Min. : 0.0 Min. : 690 Min. : 651 Length:64080
## 1st Qu.: 0.0 1st Qu.:1330 1st Qu.: 4000 Class :character
## Median : 0.0 Median :1560 Median : 6050 Mode :character
## Mean : 108.2 Mean :1676 Mean : 6611
## 3rd Qu.: 0.0 3rd Qu.:1910 3rd Qu.: 8292
## Max. :2015.0 Max. :4950 Max. :128066
## geometry basementPresent conditionType houseSize
## POINT :64080 Length:64080 Length:64080 Length:64080
## epsg:4326 : 0 Class :character Class :character Class :character
## +proj=long...: 0 Mode :character Mode :character Mode :character
##
##
##
seattleData <- seattleData %>%
mutate(bedroomIndex = case_when(
bedrooms >= 1 ~ 1,
bedrooms < 1 ~ 0))
seattleData$bedroomIndex[is.na(seattleData$bedroomIndex)] <- 0
lm(bedrooms ~ sqft_living, data = seattleData)
##
## Call:
## lm(formula = bedrooms ~ sqft_living, data = seattleData)
##
## Coefficients:
## (Intercept) sqft_living
## 1.8342456 0.0007639
for(i in 1:nrow(seattleData))
{
if(seattleData$bedroomIndex[i] == 0)
{seattleData$bedrooms[i] = 1.9054627 + 0.0007142*seattleData$sqft_living}
}
seattleData$bedrooms <- round(seattleData$bedrooms, digits = 0)
seattleData <- seattleData %>%
mutate(pricePerFt2 = (price/sqft_living)) #might need to do the linear regression bathroom and rooms for this stuff
seattleData <- seattleData %>%
mutate(buildingAge = (2015 -(yr_built)))
The census information from 2021 offers a detailed look at the community and housing in King County, including stuff like how many people live there, what types of homes they have, and income levels. To use this data with housing prices from 2015, we need to remember that things might have changed over those six years, which could affect the home prices in Seattle.
acsVariables2015 <- load_variables(2015, "acs5", cache = TRUE)
# 2021, A
# Retrieve ACS data for Seattle tracts in 2015
kcTracts <- get_acs(
geography = "tract",
variables = c(
"B01003_001", # Total Population
"B19013_001", # Median Household Income
"B25058_001", # Median Rent
"B25008_002", # Owner-Occupied Units
"B25008_003", # Renter-Occupied Units
"B07001_032", # Same House 75 Years Ago
"B07001_017", # Same House 1 Year Ago
"B25088_003", # Median Selected Monthly Owner Costs (homes without a mortgage)
"B25088_002", # Median Selected Monthly Owner Costs (homes with a mortgage)
"B15003_022", # Educational Attainment: Bachelor's Degree
"B17001_002", # Percentage of Population Below the Poverty Level
"B28002_004", # Percentage of Housing Units with High-Speed Internet
"B25044_003", # Percentage of Housing Units with No Vehicle Available
"B02001_002", # Race and Ethnicity: White Alone
"B02001_003", # Race and Ethnicity: Black or African American Alone
"B03001_003" # Hispanic or Latino Origin of Population
),
year = 2021,
state = "WA",#washington
county = "King",#kings county
geometry = TRUE,
output = "wide"
)%>%
select(-NAME, -ends_with("M")) %>%
rename(totalPop = B01003_001E, # Total Population
medIncome = B19013_001E, # Median Household Income
medRent = B25058_001E, # Median Rent
ownerUnits = B25008_002E, # Owner-Occupied Units
renterUnits = B25008_003E, # Renter-Occupied Units
sameHouse75 = B07001_032E, # Same House 75 Years Ago
sameHouse1 = B07001_017E, # Same House 1 Year Ago
medianNoMortgage = B25088_003E, # Median Selected Monthly Owner Costs (homes without a mortgage)
medianMortgage = B25088_002E, # Median Selected Monthly Owner Costs (homes with a mortgage)
bachelors = B15003_022E, # Educational Attainment: Bachelor's Degree
belowPoverty = B17001_002E, # Percentage of Population Below the Poverty Level
highSpeedInternet = B28002_004E, # Percentage of Housing Units with High-Speed Internet
noVehicle = B25044_003E, # Percentage of Housing Units with No Vehicle Available
White = B02001_002E, # Race and Ethnicity: White Alone
Black = B02001_003E, # Race and Ethnicity: Black or African American Alone
hispanicLatino = B03001_003E # Race and Ethnicity: Hispanic or Latino
)
## | | | 0% | |= | 1% | |== | 2% | |=== | 4% | |==== | 6% | |===== | 7% | |====== | 8% | |======= | 9% | |======== | 12% | |========= | 13% | |========== | 14% | |=========== | 15% | |============= | 19% | |============== | 20% | |=============== | 21% | |================ | 22% | |================ | 24% | |================= | 25% | |====================== | 32% | |======================== | 34% | |========================= | 35% | |========================= | 36% | |========================== | 38% | |=========================== | 39% | |============================ | 40% | |================================= | 47% | |================================== | 48% | |================================== | 49% | |=================================== | 50% | |=========================================== | 62% | |============================================ | 63% | |============================================= | 64% | |========================================================= | 81% | |========================================================= | 82% | |========================================================== | 83% | |======================================================================| 100%
# Transform the data to ESRI:102728 projection
kcTracts <- kcTracts %>% st_transform(st_crs(seattleData)) #need to add projection to kc stuff earlier
#also need to filter out the tracts for seattle!!!!
seattleSchools <- read.csv("Seattle_Public_Schools_Sites_2023-2024.csv")
seattleSchools <- st_as_sf(seattleSchools, coords = c("Longitude", "Latitude"), crs = 4326) %>%
st_transform(st_crs(seattleData))
#mapping nearest school
closestSchool <- sf::st_nearest_feature(seattleData, seattleSchools)
#converting into rsgeo geometries
x <- rsgeo::as_rsgeo(seattleData)
y <- rsgeo::as_rsgeo(seattleSchools)
## Calculating distance
seattleData$dist2NearestSchool <- rsgeo::distance_euclidean_pairwise(x, y[closestSchool])
#Police Stations
kcPS <- read.csv("Police_Station_Locations_in_King_County___kcp_loc_point.csv")
seattlePS <- kcPS %>%
filter(ZIPCODE %in% c("98101","98102","98103","98104","98105", "98106","98107", "98108","98109","98111","98112","98113",'98114","98115"."98116","98117","98118","98119","98120","98121","98122","98123","98124","98125","98126","98131","98132',"98133","98134","98135","98136","98138","98144","98145","98146","98148","98154","98155","98158","98160","98161","98164","98166","98168","98171","98174","98177","98178","98188","98198","98199"))
kcPS <- st_as_sf(seattlePS, coords = c("X", "Y"), crs = 4326) %>%
st_transform(st_crs(seattleData))
# Checking CRS of seattleData and seattlePS
crs_seattleData <- st_crs(seattleData)
crs_seattlePS <- st_crs(seattlePS)
print(crs_seattleData)
## Coordinate Reference System:
## User input: EPSG:4326
## wkt:
## GEOGCRS["WGS 84",
## ENSEMBLE["World Geodetic System 1984 ensemble",
## MEMBER["World Geodetic System 1984 (Transit)"],
## MEMBER["World Geodetic System 1984 (G730)"],
## MEMBER["World Geodetic System 1984 (G873)"],
## MEMBER["World Geodetic System 1984 (G1150)"],
## MEMBER["World Geodetic System 1984 (G1674)"],
## MEMBER["World Geodetic System 1984 (G1762)"],
## MEMBER["World Geodetic System 1984 (G2139)"],
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]],
## ENSEMBLEACCURACY[2.0]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["geodetic latitude (Lat)",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["geodetic longitude (Lon)",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## USAGE[
## SCOPE["Horizontal component of 3D system."],
## AREA["World."],
## BBOX[-90,-180,90,180]],
## ID["EPSG",4326]]
print(crs_seattlePS)
## Coordinate Reference System: NA
# Checking class of the datasets and ensuring crs is same
print(class(seattleData))
## [1] "sf" "data.frame"
print(class(seattlePS))
## [1] "data.frame"
if (!inherits(seattlePS, "sf")) {
seattlePS <- st_as_sf(seattlePS, coords = c("X", "Y"), crs = 4326)
}
# Transforming seattlePS to match the crs of seattleData
if (!st_crs(seattleData) == st_crs(seattlePS)) {
seattlePS <- st_transform(seattlePS, st_crs(seattleData))
}
#mapping nearest police stations
closestPS <- sf::st_nearest_feature(seattleData, seattlePS)
#converting into rsgeo geometries
x <- rsgeo::as_rsgeo(seattleData)
y <- rsgeo::as_rsgeo(seattlePS)
#calculating distance
seattleData$dist2NearestPS <- rsgeo::distance_euclidean_pairwise(x,y[closestPS])
#Transit Stops
kcTransitStops <- read.csv("Transit_Stops_for_King_County_Metro___transitstop_point.csv")
seattleTransitStops <- kcTransitStops %>%
filter(ZIPCODE %in% c("98101","98102","98103","98104","98105", "98106","98107", "98108","98109","98111","98112","98113",'98114","98115"."98116","98117","98118","98119","98120","98121","98122","98123","98124","98125","98126","98131","98132',"98133","98134","98135","98136","98138","98144","98145","98146","98148","98154","98155","98158","98160","98161","98164","98166","98168","98171","98174","98177","98178","98188","98198","98199"))
kcTransitStops <- st_as_sf(seattleTransitStops, coords = c("X", "Y"), crs = 4326) %>%
st_transform(st_crs(seattleData))
kcTransitStops <- st_transform(kcTransitStops, st_crs(seattleData))
closestTransitStop <- st_nearest_feature(seattleData, kcTransitStops)
seattleData$dist2NearestTransitStop <- st_distance(seattleData, kcTransitStops[closestTransitStop,], by_element = TRUE)
# converting it to character
seattleData$dist2NearestTransitStop <- as.character(seattleData$dist2NearestTransitStop)
# removing '[m]' from the string
seattleData$dist2NearestTransitStop <- sub("\\s\\[m\\]", "", seattleData$dist2NearestTransitStop)
# converting the results back to numeric
seattleData$dist2NearestTransitStop <- as.numeric(seattleData$dist2NearestTransitStop)
kcTracts <- kcTracts %>%
mutate(PctWhite = ((White/totalPop)*100),
PctBlack = ((Black/totalPop)*100),
PctHispanic = ((hispanicLatino/totalPop)*100),
PctBachelors = ((bachelors/totalPop)*100),
PctPoverty = ((belowPoverty/totalPop)*100))
# Filter kctracts to include only the specified GEOID range
kcTracts_filtered <- kcTracts %>%
filter(as.numeric(GEOID) >= 53033000100 & as.numeric(GEOID) <= 53033030100)
#removing extra columns
kcTracts_filtered <- kcTracts_filtered %>%
select(-medRent, -ownerUnits, -renterUnits, -sameHouse75, -sameHouse1, -medianNoMortgage, -medianMortgage, -bachelors, -belowPoverty, -highSpeedInternet, -noVehicle, -Black, -White, -hispanicLatino, -totalPop, -geometry)
# Check the result
head(kcTracts_filtered)
kcTracts_filtered <- kcTracts_filtered %>%
rename(TRACT = GEOID) %>% # Rename GEOID to TRACT
mutate(TRACT = str_sub(TRACT, -6)) # Keep the last 6 characters
# Joining data from two datasets
seattleData <- st_join(seattleData, kcTracts_filtered)
Summary statistics provide a quick snapshot of key attributes of your dataset, such as the average values, variability, and distribution across different variables, which can indicate trends and outliers in your data. They’re essential for understanding the general behavior of the dataset before diving into deeper analysis.
These statistics provide an overview of the central tendency, spread, and range of the internal variables for Seattle homes, which include prices, price per square foot, living area size, year built, number of rooms etc.
InternalVariables <- seattleData
InternalVariables <- st_drop_geometry(InternalVariables)
InternalVariables <- InternalVariables %>%
dplyr::select("price", "pricePerFt2", "sqft_living", "yr_built", "bedrooms", "bathrooms", "condition", "sqft_basement", "sqft_lot")
stargazer(as.data.frame(InternalVariables), type="text", digits=1, title = "Descriptive Statistics for Seattle Homes Internal Variables", out = "Training_seattleInternal.txt")
##
## Descriptive Statistics for Seattle Homes Internal Variables
## =========================================================
## Statistic N Mean St. Dev. Min Max
## ---------------------------------------------------------
## price 64,080 502,872.6 347,834.7 78,000 7,700,000
## pricePerFt2 64,080 286.9 116.4 87.6 798.1
## sqft_living 64,080 1,776.6 792.4 380 12,050
## yr_built 64,080 1,957.0 31.8 1,900 2,015
## bedrooms 64,080 3.2 1.1 1 33
## bathrooms 64,080 1.9 0.8 0.0 8.0
## condition 64,080 3.4 0.7 1 5
## sqft_basement 64,080 323.3 423.3 0 3,480
## sqft_lot 64,080 6,986.9 6,015.4 520 137,214
## ---------------------------------------------------------
It is observed that the average sale price of homes in Seattle is approximately 502,782.6 USD. The sale prices vary quite widely with a high standard deviation, indicating a more heterogenous market.
The average age for a house in Seattle is approximately 65 years indicating a large share of the market is held by homes constructed a long time ago.
These statistics provide an overview of the central tendency, spread, and range of the internal variables for Seattle homes, which include median income, percentage of white population, percentage of black population, percentage of hispanic population, percentage of population with a bachelor’s degree and percentage of population living in poverty.
DemographicVariables <- seattleData
DemographicVariables <- st_drop_geometry(DemographicVariables)
DemographicVariables <- DemographicVariables %>%
dplyr::select("PctWhite", "PctBlack", "PctHispanic", "PctBachelors", "PctPoverty", "medIncome")
stargazer(as.data.frame(DemographicVariables), type="text", digits=1, title = "Descriptive Statistics for Seattle Homes Demographic Variables", out = "Training_seattleSpatial.txt")
##
## Descriptive Statistics for Seattle Homes Demographic Variables
## =====================================================
## Statistic N Mean St. Dev. Min Max
## -----------------------------------------------------
## PctWhite 63,970 63.6 18.0 7.3 93.1
## PctBlack 63,970 7.4 7.6 0.0 43.3
## PctHispanic 63,970 10.1 7.4 0.8 39.1
## PctBachelors 63,970 24.9 8.4 2.8 53.4
## PctPoverty 63,970 8.4 5.3 0.0 46.0
## medIncome 63,760 108,185.7 34,358.2 35,684 250,001
## -----------------------------------------------------
The table presents descriptive statistics for demographic variables related to Seattle homes, offering insights into the characteristics of the population in different areas.
On average, approximately 63.6% of the population in the observed areas is White. The range of White population percentages spans from 7.3% to 93.1%, indicating a wide diversity in racial composition across different areas of Seattle. The same does not hold true for Black and Hispanic population percentages indicating the presence of White-majority neighborhoods in different parts of Seattle.
The average median household income in the observed areas is $108,185.2 with approximately 8.4% of the population in the observed areas falls below the poverty level, on average.
These statistics provide an overview of the central tendency, spread, and range of the internal variables for Seattle homes, which include distances to key nearest amenities.
SpatialVariables <- seattleData
SpatialVariables <- st_drop_geometry(SpatialVariables)
SpatialVariables <- SpatialVariables %>%
dplyr::select("dist2NearestSchool", "dist2NearestPS", "dist2NearestTransitStop")
stargazer(as.data.frame(SpatialVariables), type="text", digits=1, title = "Descriptive Statistics for Seattle Homes Spatial Variables (Figure 4.1)", out = "Training_seattleSpatial.txt")
##
## Descriptive Statistics for Seattle Homes Spatial Variables (Figure 4.1)
## ===========================================================
## Statistic N Mean St. Dev. Min Max
## -----------------------------------------------------------
## dist2NearestSchool 64,080 0.02 0.03 0.000 0.2
## dist2NearestPS 64,080 0.02 0.01 0.001 0.1
## dist2NearestTransitStop 64,080 191.0 183.4 2.8 2,330.9
## -----------------------------------------------------------
These statistics offer insights into the spatial characteristics of Seattle neighborhoods where the “Mean” in each case represents the average distance to a specific amenity like police stations, schools and transit stops from a listed property.
The following code generates a set of scatterplots to explore the relationships between the sale price of homes and various continuous variables. The data is first filtered to exclude observations where the sale_price is less than 5,000,000 to remove any outliers. For the remaining data, a set of scatterplots is created. The scatterplots help visualize the relationships between the sale price of homes and each of the continuous variables, as well as the direction and strength of these relationships. The regression lines indicate whether there is a linear trend in the data.
The scatterplots illustrate that in Seattle, home prices have a positive relationship with square footage and percentage of residents with a bachelor’s degree, suggesting larger homes and more educated neighborhoods tend to have higher property values. Conversely, there’s no clear pattern with the age of the house, indicating that newer does not necessarily mean more expensive in this market.
The correlation matrix provides a numerical summary of how different variables in the housing dataset are related to one another. The correlation matrix indicates that home prices in the Seattle housing dataset are positively correlated with the size of the living area and the number of bathrooms, meaning larger homes and those with more bathrooms tend to be more expensive. There seems to be a slight negative correlation between the price of homes and the distance to the nearest public school and transit stop, implying that homes closer to schools may command higher prices. The zeros in the correlation matrix highlight that there is no linear relationship between the home price and the year built or between the distances to nearest public school and transit stops, which stands out, suggesting that newer homes are not necessarily priced higher and that proximity to these amenities does not uniformly affect home values.
int_variables <- c("price", "condition", "sqft_living", "yr_built", "bedrooms", "bathrooms", "dist2NearestSchool", "dist2NearestPS", "dist2NearestTransitStop", "PctWhite", "PctBachelors", "medIncome")
seattleDatacorr <- seattleData %>%
st_drop_geometry() %>%
select(all_of(int_variables)) %>%
na.omit()
corr <- cor(seattleDatacorr)
rounded_corr <- round(corr, 1)
ggcorrplot(rounded_corr,
type = "lower",
lab = TRUE,
lab_size = 2,
colors = c("#d7191c", "#ffffbf", "#2c7bb6"),
title = "Correlation Matrix of Housing Dataset",
ggtheme = theme_bw())
#making a function for the quintile breaks
qBr <- function(data, var_name) {
# Extract the variable from the dataframe
var <- data[[var_name]]
# Calculate quintiles
quintiles <- quantile(var, probs = seq(0, 1, by = 0.25), na.rm = TRUE)
# Return breaks
return(quintiles)
}
#function for map theme
mapTheme <- function() {
tm_shape() +
tm_layout(
# Customize map background and frame
background.color = "lightblue",
frame = FALSE,
# Customize text style and color
title.text.size = 1.2,
text.color = "darkblue",
# Customize legend appearance
legend.position = c("right", "bottom"),
legend.bg.color = "lightgray",
legend.text.size = 0.8,
legend.title.size = 0.9,
# Add more customizations as needed
)
}
# Creating quintiles from the price variable
seattleData$priceQuintiles <- cut(seattleData$price,
breaks = quantile(seattleData$price, probs = seq(0, 1, by = 0.2)),
include.lowest = TRUE,
labels = FALSE)
# Converting the quintiles into a factor
seattleData$priceQuintiles <- factor(seattleData$priceQuintiles)
#palette
palette <- c('#d7191c', '#fdae61', '#ffffbf', '#abd9e9', '#2c7bb6')
#mapping sale price
ggplot() +
geom_sf(data = kcTracts_filtered, fill = "grey89", color = "darkgrey") +
geom_sf(data = seattleData, aes(colour = priceQuintiles)) + # Ensure q5(price) is a factor
scale_color_manual(values = palette, # Use scale_color_manual to match the color aesthetic
labels = qBr(seattleData, "price"),
name = "Quintile Breaks:\nSale Price") + # Correct label text
labs(title = "Properties by Sale Price", subtitle = "Seattle 2015") +
theme_void()
The map shows the distribution of property sale prices across Seattle in
2015, with areas color-coded by quintile to indicate the range of sale
prices. Red indicates the lowest price quintile while blue represents
the highest, illustrating the spatial patterns of housing affordability
or value within the city. This could reflect various socioeconomic
factors, urban development patterns, and housing demand in different
neighborhoods.
The maps visualize the spatial distribution of basements, property conditions and property size across Seattle’s residential properties in 2015.
#palette
palette <- c('#d7191c', '#fdae61', '#ffffbf', '#abd9e9', '#2c7bb6')
#mapping interior condition - works
ggplot()+
geom_sf(data = kcTracts_filtered, fill = "grey89", color = "darkgrey") +
geom_sf(data = seattleData, aes(colour =(conditionType)),
show.legend = "point", size = .5) +
scale_colour_manual(values = palette, name = "Condition Type") +
labs(title = "Properties by Condition Type",
subtitle = "Seattle 2015")+
theme_void()
#mapping basement presence
ggplot()+
geom_sf(data = kcTracts_filtered, fill = "grey89", color = "darkgrey") +
geom_sf(data = seattleData, aes(colour =(basementPresent)),
show.legend = "point", size = .5) +
scale_colour_manual(values = palette, name = "Basement Presence") +
labs(title = "Properties by Basement Presence",
subtitle = "Seattle 2015")+
theme_void()
#mapping house size
ggplot()+
geom_sf(data = kcTracts_filtered, fill = "grey89", color = "darkgrey") +
geom_sf(data = seattleData, aes(colour =(houseSize)),
show.legend = "point", size = .5) +
scale_colour_manual(values = palette, name = "House Size") +
labs(title = "Properties by Size",
subtitle = "Seattle 2015")+
theme_void()
## Mapping Spatial Variables
The map illustrates the distribution of properties around schools in Seattle for the year 2015, segmented into quintiles. Areas with a high concentration of properties in the top quintiles (red and orange) suggest neighborhoods with higher property values near schools, potentially indicating a desirability for educational proximity.
#spatial variables are schools, police stations, and transit stops
# Creating quintiles from the price variable
seattleData$priceQuintiles <- cut(seattleData$price,
breaks = quantile(seattleData$price, probs = seq(0, 1, by = 0.2)),
include.lowest = TRUE,
labels = FALSE)
# Converting the quintiles into a factor
seattleData$priceQuintiles <- factor(seattleData$priceQuintiles)
# Palette
palette <- c("#d7191c", "#fdae61", "#ffffbf", "#abdbe9", "#2c7bb6")
# Plot
ggplot() +
geom_sf(data = kcTracts_filtered, fill = "grey89", color = "darkgrey") +
geom_sf(data = seattleData, aes(colour = priceQuintiles), size = 0.75, alpha = 0.3) +
scale_color_manual(values = palette,
labels = c("Quintile 1", "Quintile 2", "Quintile 3", "Quintile 4", "Quintile 5")) +
labs(title = "Properties around schools", subtitle = "Seattle 2015") +
theme_void()
## Mapping Demographic Variables
This map represents the distribution of median household income around properties in Seattle for the year 2015. The color gradation reflects varying income levels, with darker shades indicating higher income brackets. Areas shaded in darker blue, for instance, suggest higher median incomes, which may correlate with more expensive properties or affluent neighborhoods.
#palette
palette <- c('#d7191c', '#fdae61', '#ffffbf', '#abd9e9', '#2c7bb6')
#mapping median income around properties
ggplot() +
geom_sf(data = kcTracts_filtered, fill = "grey89", color = "darkgrey") +
geom_sf(data = kcTracts_filtered, aes(fill = medIncome), color = "transparent", alpha = 0.5) +
scale_fill_gradientn(colors = palette) +
labs(title = "Median Income Around Properties",
subtitle = "Seattle 2015") +
theme_void()
# Regression Model
Regression models are statistical methods used to predict the value of a dependent variable based on one or more independent variables, with training models built on known data for learning and test models applied to evaluate the prediction accuracy on unseen data.
We are splitting the modelling data into training and testing sets, ensuring that the data is split randomly each time.
# Split the dataset into a training set and a test set using stratified sampling
inTrain <- createDataPartition(
y = paste(seattleData$price, seattleData$pricePerFt2, seattleData$sqft_living, seattleData$condition, seattleData$bedrooms, seattleData$buildingAge, seattleData$dist2NearestPS, seattleData$dist2NearestSchool, seattleData$dist2NearestTransitStop, seattleData$medIncome, seattleData$PctWhite, seattleData$PctBlack, seattleData$PctHispanic, seattleData$PctPoverty, seattleData$PctBachelors),
p = .70, list = FALSE) # Create a vector of indices for the training set, taking 70% of observations in the training dataset
# Subset the dataset to create the training set
seattle.training <- seattleData[inTrain,] # Training set
# Subset the dataset to create the test set
seattle.test <- seattleData[-inTrain,] # Test set
# Fitting a linear regression model to predict Sale Price using selected predictors
reg.training <-
lm(price ~ ., data = as.data.frame(seattle.training) %>%
dplyr::select(price, pricePerFt2, sqft_living, condition,
bedrooms, buildingAge, dist2NearestPS, dist2NearestSchool, dist2NearestTransitStop,
medIncome, PctWhite, PctBlack, PctHispanic, PctPoverty, PctBachelors))
summary(reg.training)
##
## Call:
## lm(formula = price ~ ., data = as.data.frame(seattle.training) %>%
## dplyr::select(price, pricePerFt2, sqft_living, condition,
## bedrooms, buildingAge, dist2NearestPS, dist2NearestSchool,
## dist2NearestTransitStop, medIncome, PctWhite, PctBlack,
## PctHispanic, PctPoverty, PctBachelors))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1319236 -36749 9211 44612 3082523
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -593918.97406 6704.09904 -88.590
## pricePerFt2 1910.18144 6.25011 305.624
## sqft_living 342.44773 0.84285 406.297
## condition 883.19894 818.67316 1.079
## bedrooms -7844.99888 569.25939 -13.781
## buildingAge -44.11444 18.43487 -2.393
## dist2NearestPS 260679.58601 46594.99661 5.595
## dist2NearestSchool -86033.24372 23493.09476 -3.662
## dist2NearestTransitStop -4.01425 3.22056 -1.246
## medIncome 0.21461 0.02394 8.966
## PctWhite -225.26706 63.62653 -3.540
## PctBlack -338.08453 124.75508 -2.710
## PctHispanic 929.91315 104.29391 8.916
## PctPoverty -49.26588 140.29131 -0.351
## PctBachelors -2145.72695 105.50078 -20.338
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## pricePerFt2 < 0.0000000000000002 ***
## sqft_living < 0.0000000000000002 ***
## condition 0.28067
## bedrooms < 0.0000000000000002 ***
## buildingAge 0.01672 *
## dist2NearestPS 0.0000000222 ***
## dist2NearestSchool 0.00025 ***
## dist2NearestTransitStop 0.21261
## medIncome < 0.0000000000000002 ***
## PctWhite 0.00040 ***
## PctBlack 0.00673 **
## PctHispanic < 0.0000000000000002 ***
## PctPoverty 0.72546
## PctBachelors < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 114000 on 47643 degrees of freedom
## (238 observations deleted due to missingness)
## Multiple R-squared: 0.8936, Adjusted R-squared: 0.8935
## F-statistic: 2.857e+04 on 14 and 47643 DF, p-value: < 0.00000000000000022
## Plot regression
#effect_plot(reg2, pred = number_of_bathrooms, interval = TRUE, plot.points = TRUE)
plot_summs(reg.training, scale = TRUE)
# Make predictions on the test set and evaluate model performance
seattle.test <-
seattle.test %>% # Pipe the test set into the following operations
# Add a column indicating the type of regression model used
mutate(Regression = "Baseline Regression",
# Predict sale prices using the trained regression model
SalePrice.Predict = predict(reg.training, seattle.test),
# Calculate the difference between predicted and actual sale prices
SalePrice.Error = SalePrice.Predict - price,
# Calculate the absolute difference between predicted and actual sale prices
SalePrice.AbsError = abs(SalePrice.Predict - price),
# Calculate the absolute percentage error
SalePrice.APE = (abs(SalePrice.Predict - price)) / price) %>%
filter(price < 5000000) # Filter out records with price greater than $5,000,000
The mean absolute error (MAE) for our model is currently
approximately $64,798, and the Mean Absolute Percentage
Error (MAPE) stands at around 17%. We experimented with various variable
combinations to reduce these metrics, and we managed to achieve results
close to the range of $50,000 to $100,000 (with a MAPE of 17%) by
incorporating variables like “pricePerFt2”. Variables like price per
square foot (pricePerFt2) and educational attainment (PctBachelors) are
significant predictors, implying a strong relationship between these
factors and housing prices.
mean(seattle.test$SalePrice.AbsError, na.rm = T) #MAE is the mean absolute error
## [1] 64798.57
mean(seattle.test$SalePrice.APE, na.rm = T)
## [1] 0.1709712
We used 100 folds for cross-validation. The model shows strong predictive power with an R-squared of 0.9, indicating that 90% of the variability in the housing prices can be explained by the model’s predictors. However, there is notable variation in the model’s performance across different cross-validation folds, as seen by the std deviation and average values in RMSE and MAE. The range of Mean Absolute Error (MAE) from around 57,980 to 78,327 highlights variability in prediction accuracy, which suggests that while the model explains much of the variance, its predictive performance could be inconsistent across different data samples. The Root Mean Square Error (RMSE) provides a sense of the magnitude of prediction errors. With an average RMSE of 112,359.1, the model’s predictions deviate from the actual values by this amount on average. The model may be effective for the dataset used, given the high R-squared value, but the RMSE suggests that the actual prediction errors could be substantial, affecting the model’s practical application in predicting housing prices.
seattle.test <-
seattle.test %>%
filter(price < 5000000)
# Set up cross-validation for model evaluation
fitControl <- trainControl(method = "cv", number = 100)
# Set the seed for reproducibility
set.seed(825) #seed makes sure that you get random numbers, so you do not get very different answers every time. This is done to make sure analysis is robust.
# Train a linear regression model using cross-validation
reg.cv <-
train(price ~ .,
data = st_drop_geometry(seattleData) %>%
dplyr::select(price, pricePerFt2, sqft_living, condition,
bedrooms, buildingAge, dist2NearestPS, dist2NearestSchool, dist2NearestTransitStop,
medIncome, PctWhite, PctBlack, PctHispanic, PctPoverty, PctBachelors),
method = "lm", # Specify the modeling method as linear regression
trControl = fitControl, # Specify the cross-validation settings
na.action = na.pass) # Specify how to handle missing values
stargazer(as.data.frame(reg.cv$resample), type="text", digits=1, title="Cross Validation Results", out = "CV.txt") #all cv
##
## Cross Validation Results
## ===================================================
## Statistic N Mean St. Dev. Min Max
## ---------------------------------------------------
## RMSE 100 112,359.1 18,263.0 89,048.0 176,096.0
## Rsquared 100 0.9 0.02 0.9 0.9
## MAE 100 66,010.9 3,915.9 57,980.8 78,327.6
## ---------------------------------------------------
# View the results of the cross-validated linear regression model
reg.cv
## Linear Regression
##
## 64080 samples
## 14 predictor
##
## No pre-processing
## Resampling: Cross-Validated (100 fold)
## Summary of sample sizes: 63441, 63439, 63440, 63440, 63440, 63438, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 112359.1 0.8974212 66010.95
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Most of our MAE seem to be within the $57,000 to $80,000 range (with a mean of $66,000) which is illustrated in the following histogram.
#plotting the cross validation stuff
ggplot(reg.cv$resample, aes(x=MAE)) +
geom_histogram(fill = "#2c7bb6") +
labs(title = "Cross Validation Tests in Mean Average Error") +
theme_minimal()
The histogram visualizes the distribution of the Mean Absolute Error
(MAE) from cross-validation tests of the regression model. The majority
of the MAE values cluster around the 65000 to 70000 range, indicating
that the model’s prediction errors typically fall within this band.
Fewer instances of the model exhibit significantly higher errors, with
some outliers around 75000.
seattle.test %>%
dplyr::select(SalePrice.Predict, price) %>%
ggplot(aes(price, SalePrice.Predict)) +
geom_point() +
stat_smooth(aes(price, price),
method = "lm", se = FALSE, size = 1, colour="#d7191c") +
stat_smooth(aes(SalePrice.Predict, price),
method = "lm", se = FALSE, size = 1, colour="#2c7bb6") +
labs(title="Predicted Sale Price as a Function of Observed Price",
subtitle="Red line represents a perfect prediction; Blue line represents prediction",
x = "Observed Price",
y = "Predicted Price") +
theme_minimal()+ xlim(0, 2000000) + ylim(0, 2000000)
The blue line, representing the model’s predictions, closely follows the red line, indicating that the model is performing well for many of the data points, particularly at lower price ranges. However, as the observed price increases, the predictions tend to deviate from the perfect line, showing some underestimation or overestimation by the model. The clustering of points along the blue line also suggests greater variability in predictions as the house price increases, which could be due to factors not captured by the model or the impact of higher-value outliers.
In spatial correlation analysis, filtering for values greater than 0 can focus on areas where the prediction error indicates a consistent underestimation of property values, which might reveal spatial patterns or trends. This can help identify specific regions where the model’s performance is lacking due to possible local factors or spatial dependencies not captured in the model.
#use the same thing for weights and morans i
# Spatial Lag for price
# Extract the coordinates from the 'seattleData' spatial dataframe
coords <- st_coordinates(seattleData)
neighborList <- knn2nb(knearneigh(coords, 5))
spatialWeights <- nb2listw(neighborList, style="W")
seattleData$lagPrice <- lag.listw(spatialWeights, seattleData$price)
# Extract the coordinates of the test dataset
coords.test <- st_coordinates(seattle.test) #calculating spatial lag of errors to see if our model is misspecified
# Create a neighbor list using k-nearest neighbors (KNN) with k=5 for the test dataset
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
# Convert the neighbor list to a spatial weights matrix for the test dataset
spatialWeights.test <- nb2listw(neighborList.test, style="W")
seattle.test$lagPrice <- lag.listw(spatialWeights.test, seattle.test$price)
seattle.test$lagPriceError <- lag.listw(spatialWeights.test, seattle.test$SalePrice.Error, NAOK = TRUE)
# Filtering greater than 0 values
seattle.test_filter <- seattle.test %>%
filter(SalePrice.Error > 0, lagPriceError > 0)
ggplot(data = seattle.test_filter, aes(lagPriceError, SalePrice.Error)) +
geom_point(size = .85,colour = "black") +
geom_smooth(method = "lm",colour = "red",size = 1.2) +
labs(title="Price Errors") +
theme_minimal()
The graph suggests that there is a positive relationship between sale
price error and lag price error, indicating that neighborhoods with
higher sale price predictions tend to also have neighboring areas with
similar prediction errors.
Moran’s I is a measure of spatial autocorrelation, meaning it tells us whether the pattern of a certain variable (like housing prices) is random or clustered across a geographic area. This suggests that there is no significant clustering of errors, a pattern that is also evident in the map. Consequently, it is probable that errors in our model may stem from other demographic or internal variables that we may not have accounted for. To improve the accuracy of our analysis, further investigation and inclusion of these potentially omitted variables might be necessary.
# for morans 1
# Spatial Lag for price
coords <- st_coordinates(seattle.test_filter)
neighborList <- knn2nb(knearneigh(coords, 5))
spatialWeights <- nb2listw(neighborList, style="W")
seattle.test_filter$lagPrice <- lag.listw(spatialWeights, seattle.test_filter$price)
seattle.test_filter$lagPriceError <- lag.listw(spatialWeights, seattle.test_filter$SalePrice.Error, NAOK = TRUE)
moranTest <- moran.mc(na.omit(seattle.test_filter$SalePrice.Error),
spatialWeights, nsim = 999)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.005) +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#d7191c",size=1) +
scale_x_continuous(limits = c(-0.5,0.5)) +
labs(title="Observed and permuted Moran's I",
subtitle= "Observed Moran's I in red",
x="Moran's I",
y="Count") +
theme_minimal()
The histogram shows the expected distribution of Moran’s I if there was no spatial autocorrelation, based on random permutations. The red line stands out on the right, suggesting the observed Moran’s I is significantly higher than what would be expected if there were no spatial pattern. This means that the variable being analyzed is not randomly distributed but, instead, is spatially clustered—neighboring areas are likely to have similar values.
We are trying to assess how errors are distributed by neighborhoods (in this case, census tracts).
# Convert the 'seattle.test' dataset to a regular data frame
seattle.test %>%
as.data.frame() %>%
# Group the data frame by the 'TRACT.x' variable
group_by(TRACT.x) %>%
# Calculate the mean of 'SalePrice.Predict' and 'SalePrice' variables within each group
summarize(meanPrediction = mean(SalePrice.Predict),
meanPrice = mean(price)) %>%
# Format the summarized data frame as an HTML table
kable() %>%
# Apply styling to enhance the appearance of the HTML table
kable_styling()
| TRACT.x | meanPrediction | meanPrice |
|---|---|---|
| 000100 | 408396.5 | 412273.0 |
| 000200 | 455614.6 | 435950.0 |
| 000300 | 386338.2 | 388143.9 |
| 000401 | 495899.1 | 495161.1 |
| 000402 | 507514.8 | 507376.0 |
| 000500 | 687182.7 | 661347.0 |
| 000600 | 384418.5 | 389072.2 |
| 001200 | 386912.4 | 390688.0 |
| 001300 | 490248.7 | 484662.9 |
| 001400 | 480903.1 | 487999.6 |
| 001500 | 642996.8 | 621617.1 |
| 001600 | 607865.7 | 588014.3 |
| 001701 | 481826.6 | 473885.2 |
| 001702 | 542028.3 | 535987.6 |
| 001800 | 615803.4 | 599850.9 |
| 001900 | 483361.8 | 482372.4 |
| 002700 | 608407.7 | 595790.8 |
| 002800 | 593094.3 | 583963.9 |
| 002900 | 611281.1 | 594322.2 |
| 003200 | 636217.2 | 609013.9 |
| 003300 | 622431.9 | 581122.6 |
| 003400 | 612525.2 | 600736.8 |
| 003500 | 608806.5 | 586739.3 |
| 003600 | 574679.3 | 558455.0 |
| 004100 | 844035.1 | 829986.7 |
| 004200 | 851709.7 | 871522.8 |
| 004301 | 949268.5 | 975664.5 |
| 004302 | 765597.2 | 736649.1 |
| 004400 | 959381.9 | 984070.9 |
| 004500 | 682901.9 | 672742.7 |
| 004600 | 708752.6 | 693552.4 |
| 004700 | 593587.4 | 556556.1 |
| 004800 | 606356.0 | 593047.2 |
| 004900 | 593758.6 | 576996.9 |
| 005000 | 594657.1 | 586358.8 |
| 005100 | 670051.5 | 661225.2 |
| 005200 | 629665.5 | 616741.9 |
| 005301 | 918062.8 | 924991.7 |
| 005302 | 770062.7 | 762283.3 |
| 005400 | 653737.7 | 649910.8 |
| 005600 | 747574.6 | 729768.9 |
| 005700 | 846099.3 | 848519.1 |
| 005801 | 766564.0 | 728406.9 |
| 005802 | 884284.6 | 860121.5 |
| 006000 | 801552.2 | 803461.5 |
| 006100 | 943774.1 | 975929.8 |
| 006200 | 994527.9 | 1077813.9 |
| 006300 | 1140166.3 | 1231542.3 |
| 006400 | 997393.9 | 1056110.0 |
| 006500 | 926467.2 | 965677.4 |
| 006600 | 844714.2 | 862680.0 |
| 006700 | 862169.2 | 886832.3 |
| 006800 | 856992.2 | 907942.9 |
| 007000 | 870329.3 | 875688.6 |
| 007100 | 875535.4 | 909261.9 |
| 007200 | 803389.7 | 787635.2 |
| 007300 | 883930.6 | 922821.2 |
| 007401 | 832335.9 | 852402.1 |
| 007402 | 724870.0 | 701701.4 |
| 007500 | 967267.2 | 1010869.7 |
| 007600 | 1045993.4 | 1098221.6 |
| 007700 | 1025071.7 | 1077597.8 |
| 007800 | NA | 788339.8 |
| 008001 | 917542.2 | 944048.0 |
| 008900 | 601689.1 | 596845.6 |
| 009000 | NA | 535274.4 |
| 009100 | NA | 590712.0 |
| 009300 | 564423.8 | 557867.1 |
| 009400 | NA | 613868.2 |
| 009500 | NA | 556950.2 |
| 009900 | 295241.8 | 316194.0 |
| 010001 | NA | 513216.3 |
| 010002 | 590506.3 | 596202.1 |
| 010100 | NA | 522958.2 |
| 010300 | NA | 331494.8 |
| 010401 | NA | 368135.9 |
| 010402 | NA | 352543.6 |
| 010500 | 612886.1 | 600914.6 |
| 010600 | 594679.2 | 594630.6 |
| 010701 | 323077.8 | 326578.1 |
| 010800 | 302975.9 | 316000.2 |
| 010900 | NA | 369658.6 |
| 011001 | NA | 358692.5 |
| 011002 | NA | 373160.8 |
| 011200 | NA | 348052.2 |
| 011300 | NA | 335767.8 |
| 011401 | 316696.3 | 328247.0 |
| 011402 | 316880.7 | 319770.0 |
| 011500 | NA | 542428.8 |
| 011600 | 559453.5 | 544531.2 |
| 011700 | NA | 322258.2 |
| 011900 | 286875.9 | 298848.9 |
| 012000 | 426077.8 | 427345.1 |
| 012100 | 430088.0 | 423636.0 |
| 020100 | 697313.3 | 673145.2 |
| 020200 | 474660.2 | 472187.6 |
| 020300 | 394008.5 | 395403.1 |
| 020401 | 429304.8 | 411798.3 |
| 020402 | 433415.7 | 421201.3 |
| 020500 | 424039.1 | 422788.5 |
| 020600 | 389114.6 | 392086.4 |
| 020700 | 376172.3 | 385195.3 |
| 020800 | 479249.5 | 479535.4 |
| 020900 | 476539.6 | 478707.4 |
| 021000 | 395940.6 | 395076.5 |
| 021100 | 441742.7 | 454439.7 |
| 021300 | 464069.9 | 446132.1 |
| 021400 | 408206.5 | 425368.8 |
| 021500 | 452796.9 | 458409.8 |
| 025302 | 294087.9 | 293206.9 |
| 026001 | 314717.9 | 314298.3 |
| 026002 | 364007.6 | 353038.4 |
| 026100 | 288441.2 | 301148.1 |
| 026200 | 263599.5 | 270045.9 |
| 026300 | NA | 304962.9 |
| 026400 | NA | 282187.4 |
| 026500 | NA | 312658.5 |
| 026600 | 283887.1 | 304919.6 |
| 026700 | 405626.1 | 407822.0 |
| 026801 | 271963.6 | 284599.6 |
| 026802 | 284854.8 | 300743.9 |
| 027000 | 207912.1 | 234381.5 |
| 027100 | 230445.2 | 249655.2 |
| 027200 | 261838.3 | 259760.2 |
| 027300 | 238187.9 | 248090.7 |
| 027400 | 273473.3 | 262514.2 |
| 027500 | 313622.3 | 330425.3 |
| 027600 | 408501.2 | 390377.8 |
| 027800 | 443406.1 | 415567.1 |
| 027900 | 327809.2 | 331005.1 |
| 028000 | 219655.5 | 244000.6 |
| 028100 | 242268.1 | 256504.5 |
| 028200 | 209190.7 | 242897.5 |
| 028300 | NA | 311186.8 |
| 028402 | 280741.7 | 280661.3 |
| 028403 | 288540.7 | 277701.9 |
| 028500 | 317727.6 | 302086.8 |
| 028600 | NA | 405907.3 |
| 028700 | NA | 292581.7 |
| 028801 | NA | 312876.0 |
| 028802 | NA | 299605.8 |
| 028901 | NA | 281782.5 |
| 028902 | NA | 284013.2 |
| 029001 | NA | 268994.0 |
| 029003 | NA | 341979.8 |
| 029004 | NA | 315470.2 |
| 029101 | NA | 329347.8 |
| 030003 | NA | 301543.6 |
| 030100 | NA | 282832.9 |
The table shows that the model’s predictions for housing prices by census tract are reasonably close to the actual mean prices, suggesting a good model fit for the tracts.
Let’s try to run the regression again, but this time with a neighborhood fixed effect. We do this by simply adding the variable ‘TRACT.x’ which is the census tract to the regression model as a neighborhood-level dummy variable.
# This code fits a linear regression model ('reg.nhood') to predict 'SalePrice' using various predictors, including neighborhood-related variables, using the training dataset.
# Fit a linear regression model to the training dataset ('boston.training')
reg.nhood <- lm(price ~ ., #adding dummy variable for name
data = as.data.frame(seattle.training) %>%
dplyr::select(TRACT.x, price, pricePerFt2, sqft_living, condition,
bedrooms, buildingAge, dist2NearestPS, dist2NearestSchool, dist2NearestTransitStop,
medIncome, PctWhite, PctBlack, PctHispanic, PctPoverty, PctBachelors))
#View the model stats
summary(reg.nhood)
##
## Call:
## lm(formula = price ~ ., data = as.data.frame(seattle.training) %>%
## dplyr::select(TRACT.x, price, pricePerFt2, sqft_living, condition,
## bedrooms, buildingAge, dist2NearestPS, dist2NearestSchool,
## dist2NearestTransitStop, medIncome, PctWhite, PctBlack,
## PctHispanic, PctPoverty, PctBachelors))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1294060 -37422 11284 45273 3089251
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -587768.43258 9688.91031 -60.664
## TRACT.x000200 7497.82983 8746.97058 0.857
## TRACT.x000300 12166.16949 8743.37409 1.391
## TRACT.x000401 2163.37137 7927.91809 0.273
## TRACT.x000402 1932.95743 7956.12161 0.243
## TRACT.x000500 -7343.94131 10641.82080 -0.690
## TRACT.x000600 11091.87196 8869.56425 1.251
## TRACT.x001200 11403.81362 8759.67422 1.302
## TRACT.x001300 -2035.88546 7493.84125 -0.272
## TRACT.x001400 -441.25693 7898.16523 -0.056
## TRACT.x001500 -10203.70048 10362.78131 -0.985
## TRACT.x001600 -11409.56976 10368.42922 -1.100
## TRACT.x001701 -1196.63958 7508.02690 -0.159
## TRACT.x001702 -4739.17665 7274.40730 -0.651
## TRACT.x001800 -11932.07988 8376.47042 -1.424
## TRACT.x001900 -3570.30428 7484.16599 -0.477
## TRACT.x002700 -13188.63745 8358.68880 -1.578
## TRACT.x002800 -14292.73324 8352.25569 -1.711
## TRACT.x002900 -11693.11229 8330.26234 -1.404
## TRACT.x003200 -40450.83615 9897.58854 -4.087
## TRACT.x003300 -37653.75172 10042.43058 -3.749
## TRACT.x003400 -23554.12225 7753.63308 -3.038
## TRACT.x003500 -20284.66536 7786.14326 -2.605
## TRACT.x003600 -11812.63656 8303.63656 -1.423
## TRACT.x004100 16663.58785 10847.31437 1.536
## TRACT.x004200 4152.85460 10894.41387 0.381
## TRACT.x004301 4346.02930 10594.09535 0.410
## TRACT.x004302 21895.66749 10852.50405 2.018
## TRACT.x004400 1578.83690 10905.23313 0.145
## TRACT.x004500 -6046.42985 7883.19374 -0.767
## TRACT.x004600 -4610.63033 7817.51365 -0.590
## TRACT.x004700 -38707.51015 10069.53026 -3.844
## TRACT.x004800 -23162.50274 7768.08120 -2.982
## TRACT.x004900 -11538.67725 8408.14190 -1.372
## TRACT.x005000 -14093.59619 8240.03469 -1.710
## TRACT.x005100 -6522.36405 7837.47921 -0.832
## TRACT.x005200 -4800.81499 7858.52396 -0.611
## TRACT.x005301 8717.20030 10892.99011 0.800
## TRACT.x005302 14671.13736 10800.91943 1.358
## TRACT.x005400 -7957.50530 7832.58971 -1.016
## TRACT.x005600 -24413.15588 10100.50497 -2.417
## TRACT.x005700 -31948.14418 10014.81964 -3.190
## TRACT.x005801 -18948.22314 9853.42604 -1.923
## TRACT.x005802 -23063.15213 9932.91888 -2.322
## TRACT.x006000 33379.40523 13858.45909 2.409
## TRACT.x006100 72809.16302 9338.36068 7.797
## TRACT.x006200 45993.06457 9337.45354 4.926
## TRACT.x006300 66553.89507 9960.07822 6.682
## TRACT.x006400 73207.55065 10013.93562 7.311
## TRACT.x006500 60005.77517 9386.28574 6.393
## TRACT.x006600 41639.49000 11175.22490 3.726
## TRACT.x006700 26929.06002 14337.21519 1.878
## TRACT.x006800 14578.30620 14631.62823 0.996
## TRACT.x007000 31834.50682 13729.70312 2.319
## TRACT.x007100 25446.01960 13668.16331 1.862
## TRACT.x007200 37349.48677 13731.76427 2.720
## TRACT.x007300 22601.58455 13979.91846 1.617
## TRACT.x007401 52138.00631 14170.40803 3.679
## TRACT.x007402 61710.96552 13981.14468 4.414
## TRACT.x007500 68710.59324 9242.42223 7.434
## TRACT.x007600 75111.24406 10150.40542 7.400
## TRACT.x007700 76134.76020 10305.60430 7.388
## TRACT.x007800 36192.22584 8470.38246 4.273
## TRACT.x008001 26367.13154 13917.70559 1.895
## TRACT.x008900 1924.17400 9819.36029 0.196
## TRACT.x009000 4414.09836 9857.44903 0.448
## TRACT.x009100 2768.56341 9548.54299 0.290
## TRACT.x009300 -4874.01094 8679.22617 -0.562
## TRACT.x009400 167.86165 9901.74969 0.017
## TRACT.x009500 1768.98398 9820.39006 0.180
## TRACT.x009900 16632.61843 9533.24700 1.745
## TRACT.x010001 -3649.57012 8814.96699 -0.414
## TRACT.x010002 -2836.76629 9964.74677 -0.285
## TRACT.x010100 -8034.87066 8813.43348 -0.912
## TRACT.x010300 -20097.83744 12126.43715 -1.657
## TRACT.x010401 -14333.51586 11817.79684 -1.213
## TRACT.x010402 -9117.05304 11930.47518 -0.764
## TRACT.x010500 -12353.53333 9948.45263 -1.242
## TRACT.x010600 -15079.65043 10081.28864 -1.496
## TRACT.x010701 21786.61389 9501.29754 2.293
## TRACT.x010800 18868.14921 9451.31018 1.996
## TRACT.x010900 -16073.72141 11945.75835 -1.346
## TRACT.x011001 -6331.80543 12125.89013 -0.522
## TRACT.x011002 -14235.69075 11904.38256 -1.196
## TRACT.x011200 11216.50478 8605.31232 1.303
## TRACT.x011300 9696.82743 8621.20152 1.125
## TRACT.x011401 19375.43260 9422.35924 2.056
## TRACT.x011402 22152.63229 9625.75141 2.301
## TRACT.x011500 -12479.12852 9946.61917 -1.255
## TRACT.x011600 -12025.47879 9948.20976 -1.209
## TRACT.x011700 -1574.14519 9090.87612 -0.173
## TRACT.x011900 5257.72629 10433.04208 0.504
## TRACT.x012000 2438.57439 8357.95538 0.292
## TRACT.x012100 4925.49121 8320.32679 0.592
## TRACT.x020100 -9896.67126 10278.35946 -0.963
## TRACT.x020200 3002.42543 7948.13724 0.378
## TRACT.x020300 6793.79280 7580.89216 0.896
## TRACT.x020401 6671.78824 8715.83357 0.765
## TRACT.x020402 5875.16636 8818.18117 0.666
## TRACT.x020500 1998.32237 8667.18138 0.231
## TRACT.x020600 6087.46471 7602.37338 0.801
## TRACT.x020700 9101.65472 8670.35970 1.050
## TRACT.x020800 1526.50496 8022.46097 0.190
## TRACT.x020900 1178.54606 7911.63219 0.149
## TRACT.x021000 7783.03576 7609.82431 1.023
## TRACT.x021100 -2755.93033 8772.58360 -0.314
## TRACT.x021300 7948.53051 8859.42072 0.897
## TRACT.x021400 -4699.46215 8845.42803 -0.531
## TRACT.x021500 -540.17423 8792.10231 -0.061
## TRACT.x025302 9823.13388 10272.87997 0.956
## TRACT.x026001 9171.97694 10524.95666 0.871
## TRACT.x026002 12519.53909 10366.22100 1.208
## TRACT.x026100 5332.08679 10335.23482 0.516
## TRACT.x026200 19581.48426 8250.17309 2.373
## TRACT.x026300 15391.05711 8130.76031 1.893
## TRACT.x026400 20204.21451 8893.83683 2.272
## TRACT.x026500 13597.80259 7512.06119 1.810
## TRACT.x026600 15944.06787 8136.72426 1.960
## TRACT.x026700 19481.73014 9759.73370 1.996
## TRACT.x026801 27018.84564 8465.87589 3.192
## TRACT.x026802 25959.96548 8457.60218 3.069
## TRACT.x027000 33988.75967 10198.16140 3.333
## TRACT.x027100 35889.40921 10143.78570 3.538
## TRACT.x027200 42301.47230 10229.56225 4.135
## TRACT.x027300 39201.42459 10319.93026 3.799
## TRACT.x027400 43887.45810 10100.07547 4.345
## TRACT.x027500 11939.13334 7854.18627 1.520
## TRACT.x027600 10538.66810 8411.29149 1.253
## TRACT.x027800 -4096.35047 10141.65538 -0.404
## TRACT.x027900 16248.34592 8355.67215 1.945
## TRACT.x028000 32221.42194 9769.24185 3.298
## TRACT.x028100 25460.40832 9067.79997 2.808
## TRACT.x028200 19418.47481 9183.56587 2.114
## TRACT.x028300 28482.07469 9442.25383 3.016
## TRACT.x028402 12616.44810 11447.19068 1.102
## TRACT.x028403 8724.99811 12666.16774 0.689
## TRACT.x028500 16651.57647 11637.56396 1.431
## TRACT.x028600 13841.39964 8576.50899 1.614
## TRACT.x028700 27088.38690 10364.27100 2.614
## TRACT.x028801 23746.64851 9155.01250 2.594
## TRACT.x028802 24992.89726 9498.50409 2.631
## TRACT.x028901 20342.48814 10695.79518 1.902
## TRACT.x028902 31763.67550 10860.25952 2.925
## TRACT.x029001 23655.11311 10627.77825 2.226
## TRACT.x029003 29504.42954 10655.20134 2.769
## TRACT.x029004 31274.82437 10665.39045 2.932
## TRACT.x029101 31773.03458 10729.76725 2.961
## TRACT.x030003 32682.71714 10771.80852 3.034
## TRACT.x030100 23662.68895 10777.81762 2.195
## pricePerFt2 1879.16700 6.84914 274.365
## sqft_living 337.88209 0.86112 392.376
## condition 2549.84391 815.74246 3.126
## bedrooms -7314.94129 565.58353 -12.933
## buildingAge -72.04554 18.39781 -3.916
## dist2NearestPS 598634.16099 59734.83411 10.022
## dist2NearestSchool -245105.13298 42928.50873 -5.710
## dist2NearestTransitStop 0.43600 3.25772 0.134
## medIncome 0.11202 0.02556 4.382
## PctWhite -269.33630 71.29642 -3.778
## PctBlack -651.23568 136.65018 -4.766
## PctHispanic 450.75459 122.80427 3.671
## PctPoverty -103.84779 146.47740 -0.709
## PctBachelors -1549.06101 119.17955 -12.998
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## TRACT.x000200 0.391343
## TRACT.x000300 0.164089
## TRACT.x000401 0.784947
## TRACT.x000402 0.808043
## TRACT.x000500 0.490133
## TRACT.x000600 0.211103
## TRACT.x001200 0.192973
## TRACT.x001300 0.785874
## TRACT.x001400 0.955447
## TRACT.x001500 0.324802
## TRACT.x001600 0.271157
## TRACT.x001701 0.873369
## TRACT.x001702 0.514736
## TRACT.x001800 0.154315
## TRACT.x001900 0.633330
## TRACT.x002700 0.114610
## TRACT.x002800 0.087043 .
## TRACT.x002900 0.160417
## TRACT.x003200 0.00004378204798116 ***
## TRACT.x003300 0.000177 ***
## TRACT.x003400 0.002384 **
## TRACT.x003500 0.009184 **
## TRACT.x003600 0.154863
## TRACT.x004100 0.124497
## TRACT.x004200 0.703063
## TRACT.x004301 0.681638
## TRACT.x004302 0.043642 *
## TRACT.x004400 0.884887
## TRACT.x004500 0.443084
## TRACT.x004600 0.555339
## TRACT.x004700 0.000121 ***
## TRACT.x004800 0.002867 **
## TRACT.x004900 0.169970
## TRACT.x005000 0.087202 .
## TRACT.x005100 0.405299
## TRACT.x005200 0.541265
## TRACT.x005301 0.423565
## TRACT.x005302 0.174368
## TRACT.x005400 0.309659
## TRACT.x005600 0.015652 *
## TRACT.x005700 0.001423 **
## TRACT.x005801 0.054485 .
## TRACT.x005802 0.020243 *
## TRACT.x006000 0.016018 *
## TRACT.x006100 0.00000000000000648 ***
## TRACT.x006200 0.00000084360735197 ***
## TRACT.x006300 0.00000000002381923 ***
## TRACT.x006400 0.00000000000027019 ***
## TRACT.x006500 0.00000000016425353 ***
## TRACT.x006600 0.000195 ***
## TRACT.x006700 0.060351 .
## TRACT.x006800 0.319082
## TRACT.x007000 0.020418 *
## TRACT.x007100 0.062652 .
## TRACT.x007200 0.006532 **
## TRACT.x007300 0.105946
## TRACT.x007401 0.000234 ***
## TRACT.x007402 0.00001017620927941 ***
## TRACT.x007500 0.00000000000010692 ***
## TRACT.x007600 0.00000000000013861 ***
## TRACT.x007700 0.00000000000015183 ***
## TRACT.x007800 0.00001934114154838 ***
## TRACT.x008001 0.058164 .
## TRACT.x008900 0.844645
## TRACT.x009000 0.654305
## TRACT.x009100 0.771859
## TRACT.x009300 0.574410
## TRACT.x009400 0.986474
## TRACT.x009500 0.857048
## TRACT.x009900 0.081044 .
## TRACT.x010001 0.678862
## TRACT.x010002 0.775890
## TRACT.x010100 0.361951
## TRACT.x010300 0.097454 .
## TRACT.x010401 0.225183
## TRACT.x010402 0.444763
## TRACT.x010500 0.214333
## TRACT.x010600 0.134711
## TRACT.x010701 0.021852 *
## TRACT.x010800 0.045901 *
## TRACT.x010900 0.178451
## TRACT.x011001 0.601553
## TRACT.x011002 0.231766
## TRACT.x011200 0.192431
## TRACT.x011300 0.260694
## TRACT.x011401 0.039757 *
## TRACT.x011402 0.021374 *
## TRACT.x011500 0.209627
## TRACT.x011600 0.226743
## TRACT.x011700 0.862529
## TRACT.x011900 0.614299
## TRACT.x012000 0.770466
## TRACT.x012100 0.553865
## TRACT.x020100 0.335620
## TRACT.x020200 0.705616
## TRACT.x020300 0.370165
## TRACT.x020401 0.443990
## TRACT.x020402 0.505251
## TRACT.x020500 0.817656
## TRACT.x020600 0.423291
## TRACT.x020700 0.293841
## TRACT.x020800 0.849091
## TRACT.x020900 0.881583
## TRACT.x021000 0.306426
## TRACT.x021100 0.753406
## TRACT.x021300 0.369625
## TRACT.x021400 0.595222
## TRACT.x021500 0.951010
## TRACT.x025302 0.338966
## TRACT.x026001 0.383513
## TRACT.x026002 0.227159
## TRACT.x026100 0.605917
## TRACT.x026200 0.017626 *
## TRACT.x026300 0.058372 .
## TRACT.x026400 0.023109 *
## TRACT.x026500 0.070282 .
## TRACT.x026600 0.050058 .
## TRACT.x026700 0.045925 *
## TRACT.x026801 0.001416 **
## TRACT.x026802 0.002146 **
## TRACT.x027000 0.000860 ***
## TRACT.x027100 0.000403 ***
## TRACT.x027200 0.00003552262591508 ***
## TRACT.x027300 0.000146 ***
## TRACT.x027400 0.00001393989436685 ***
## TRACT.x027500 0.128493
## TRACT.x027600 0.210241
## TRACT.x027800 0.686278
## TRACT.x027900 0.051830 .
## TRACT.x028000 0.000974 ***
## TRACT.x028100 0.004990 **
## TRACT.x028200 0.034479 *
## TRACT.x028300 0.002559 **
## TRACT.x028402 0.270405
## TRACT.x028403 0.490926
## TRACT.x028500 0.152481
## TRACT.x028600 0.106562
## TRACT.x028700 0.008961 **
## TRACT.x028801 0.009494 **
## TRACT.x028802 0.008510 **
## TRACT.x028901 0.057188 .
## TRACT.x028902 0.003449 **
## TRACT.x029001 0.026033 *
## TRACT.x029003 0.005625 **
## TRACT.x029004 0.003366 **
## TRACT.x029101 0.003066 **
## TRACT.x030003 0.002414 **
## TRACT.x030100 0.028133 *
## pricePerFt2 < 0.0000000000000002 ***
## sqft_living < 0.0000000000000002 ***
## condition 0.001774 **
## bedrooms < 0.0000000000000002 ***
## buildingAge 0.00009016173913804 ***
## dist2NearestPS < 0.0000000000000002 ***
## dist2NearestSchool 0.00000001139074530 ***
## dist2NearestTransitStop 0.893533
## medIncome 0.00001180032790292 ***
## PctWhite 0.000158 ***
## PctBlack 0.00000188741787196 ***
## PctHispanic 0.000242 ***
## PctPoverty 0.478348
## PctBachelors < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 112800 on 47495 degrees of freedom
## (238 observations deleted due to missingness)
## Multiple R-squared: 0.8962, Adjusted R-squared: 0.8959
## F-statistic: 2532 on 162 and 47495 DF, p-value: < 0.00000000000000022
#r squared went up to 92% when neighbourhood was taken into consideration
# Create a new dataset ('seattle.test.nhood') by predicting 'price' for the test dataset ('seattle.test') using the neighborhood effects model ('reg.nhood')
seattle.test.nhood <- #doing exactly the same thing, but for this new model now
seattle.test %>%
mutate(Regression = "Neighborhood Effects", # Add a new variable indicating the type of regression model
SalePrice.Predict = predict(reg.nhood, seattle.test), # Predict 'SalePrice' using the fitted model
SalePrice.Error = SalePrice.Predict - price, # Calculate the error between predicted and actual 'SalePrice'
SalePrice.AbsError = abs(SalePrice.Predict - price), # Calculate the absolute error
SalePrice.APE = (abs(SalePrice.Predict - price)) / price) %>% # Calculate the absolute percentage error
filter(price < 5000000) # Filter out observations with 'SalePrice' greater than $5,000,000
The regression with neighborhood effects demonstrates that accounting for neighborhood-level characteristics can impact the model’s predictive power, as indicated by the high R-squared value.
The map aims to depict average residuals by census tracts to assess the model’s prediction errors across different areas of Seattle. However, due to a visualization issue, the map currently shows all areas as black, indicating a problem with the scaling or classification of the residuals into different color ranges,which we were not able to resolve, thus preventing us from drawing any accurate inferences about the spatial distribution of residuals.
coords <- st_coordinates(seattleData)
neighborList <- knn2nb(knearneigh(coords, 5))
spatialWeights <- nb2listw(neighborList, style="W")
seattleData$lagPrice <- lag.listw(spatialWeights, seattleData$price)
# Extract the coordinates of the test dataset
coords.test <- st_coordinates(seattle.test) #calculating spatial lag of errors to see if our model is misspecified
# Create a neighbor list using k-nearest neighbors (KNN) with k=5 for the test dataset
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
# Convert the neighbor list to a spatial weights matrix for the test dataset
spatialWeights.test <- nb2listw(neighborList.test, style="W")
seattle.test$lagPrice <- lag.listw(spatialWeights.test, seattle.test$price)
seattle.test$lagPriceError <- lag.listw(spatialWeights.test, seattle.test$SalePrice.Error, NAOK = TRUE)
seattle.test <- seattle.test %>%
mutate(Residual = price - SalePrice.Predict) %>%
group_by(TRACT.x) %>%
summarise(AvgResidual = mean(Residual, na.rm = TRUE))
# Ensure that kcTracts_filtered is an sf object and has the correct CRS
kcTracts_filtered <- st_as_sf(kcTracts_filtered)
kcTracts_filtered <- st_set_crs(kcTracts_filtered, st_crs(seattle.test))
# Perform the spatial join
seattle_residuals <- st_join(seattle.test, kcTracts_filtered)
# Create the color palette for the gradient
color_palette <- colorRampPalette(colors = c("#d7191c", "#fdae61", "#ffffbf", "#abd9e9", "#2c7bb6"))
# Generate a sequence of colors from the palette
palette <- color_palette(100) # This will create a smooth gradient with 100 colors
# Plot the kcTracts_filtered as the base layer in grey
ggplot() +
geom_sf(data = kcTracts_filtered, fill = "grey89") + # Base layer
geom_sf(data = seattle_residuals, aes(fill = AvgResidual), size = 0.5) + # Residuals layer
scale_fill_gradientn(colors = palette) + # Gradient scale for residuals
labs(title = "Average Residuals by Census Tracts", fill = "Avg Residual") +
theme_void()
From this analysis, one can infer that housing prices in Seattle are influenced by a combination of property characteristics, demographic factors, and proximity to amenities, and there is variability in how well the model’s predictions align with actual sale prices across different areas. This suggests that while the model captures broad trends, local factors and potential data issues may cause variations in prediction accuracy, which would be important for stakeholders to consider.